As we can see, the most important feature for XGBClassifier is thall with score of 0.019. Quite high we can observe age and sex. All of those features are expected to have great impact on patients condition.
I have trained 3 different models to compare the results. Those are RandomForestClassifier, Logistic Regression and Logistic Regression but with transformed features with StandardScaler.
Results between two Logistic Regression models are more or less the same. Among the most important features we can find sex, and different values of cp or caa. However, age is not among the most important features (it was for XGBClassifier).
Some similarities can be observed between XGBClassifier and RandomForestClassifier, such as high tall importance. Features, such as sex and age, were marked as much less important.
Interesting but quite obvious difference in importance of features is between tree-based models and LogisticRegression. Both models types learn different patterns in dataset so naturally, importance is distributed differently between features.
Let's now measure feature importance for the XGBClassifier using different approach. This time, feature called thall also receives the highest score. exng_1, which previously wasn't present among top features, now is rated as the second most important one. sex and age, as pretty much all the time, have their part on the top.
One more time, results differ slightly, when shap importance is used. thall, sex and age achieved expectedly high score. oldpeak is surprisingly higher than in any other metric analysis.
Summarising, all of those methods are quite useful in feature importance analysis. Choice of the best one might be quite difficult though. Different methods make use of different approaches, so it is important to understand them as good as you can to make the right choice for analysed problem.
!pip install pandas
!pip install plotly
!pip install seaborn
!pip install sklearn
!pip install xgboost
!pip install imblearn
!pip install dalex
!pip install shap
!pip install lime
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
import dalex as dx
import shap
import sklearn
df = pd.read_csv('data/heart.csv')
categorical_cols = ['exng', 'caa', 'cp', 'restecg']
ohe = OneHotEncoder(handle_unknown='ignore', sparse=False, drop='first')
df[categorical_cols] = df[categorical_cols].astype('category')
df_tr = df.copy()
ohe.fit(df_tr[categorical_cols])
df_tr[ohe.get_feature_names_out(categorical_cols)] = ohe.transform(df_tr[categorical_cols])
df_tr.drop(columns=categorical_cols, inplace=True)
X, y = df_tr.drop(columns=['output']), df_tr['output']
model1 = XGBClassifier()
model1.fit(X, y)
patients = X.iloc[[14, 108]]#.sample(2) #14, 108
print(patients.index)
model1.predict_proba(patients)
explainer = dx.Explainer(model1, X, y)
explainer.model_performance()
pvi = explainer.model_parts(random_state=0)
pvi.result
pvi.plot(show=False).update_layout(autosize=False, width=600, height=450)
model2 = RandomForestClassifier()
model2.fit(X, y)
explainer2 = dx.Explainer(model2, X, y)
explainer2.model_performance()
pvi2 = explainer2.model_parts(random_state=0)
pvi2.result
pvi2.plot(show=False).update_layout(autosize=False, width=600, height=450)
model3 = LogisticRegression(max_iter=1000)
model3.fit(X, y)
explainer3 = dx.Explainer(model3, X, y)
explainer3.model_performance()
pvi3 = explainer3.model_parts(random_state=0)
pvi3.result
pvi3.plot(show=False).update_layout(autosize=False, width=600, height=450)
model4 = LogisticRegression(max_iter=1000)
sc = StandardScaler()
X_tr = pd.DataFrame(sc.fit_transform(X), index=X.index, columns=X.columns)
model4.fit(X_tr, y)
explainer4 = dx.Explainer(model4, X_tr, y)
explainer4.model_performance()
pvi4 = explainer4.model_parts(random_state=0)
pvi4.result
pvi4.plot(show=False).update_layout(autosize=False, width=600, height=450)
import plotly.express as px
px.bar(pd.DataFrame({'variable': X.columns, 'importance': model1.feature_importances_}), x='variable', y='importance', title='Gini-based Variable Importance')
explainer_shap = shap.explainers.Tree(model1, X, model_output='probability')
shap.plots.bar(explainer_shap(X))